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Kernel methods for deconvolution have attractive features, and 
prevail in the literature. However, they have disadvantages, which in- 
clude the fact that they are usually suitable only for cases where the 
error distribution is infinitely supported and its characteristic func- 
tion does not ever vanish. Even in these settings, optimal convergence 
rates are achieved by kernel estimators only when the kernel is chosen 
to adapt to the unknown smoothness of the target distribution. In 
this paper we suggest alternative ridge methods, not involving ker- 
nels in any way. We show that ridge methods (a) do not require the 
assumption that the error-distribution characteristic function is non- 
vanishing; (b) adapt themselves remarkably well to the smoothness of 
the target density, with the result that the degree of smoothness does 
not need to be directly estimated; and (c) give optimal convergence 
rates in a broad range of settings. 

1. Introduction. Density estimation with observation error is almost al- 
ways based on kernel methods, where the kernel depends on the error dis- 
tribution. See, for example, the early contributions of Carroll and Hall [4], 
Liu and Taylor [22], Stefanski [27], Stefanski and Carroll [28], Zhang [30] 
and Fan [[12, 13, 14]]. More recent work, which also surveys earlier research, 
includes that of van Es, Spreij and van Zanten [29], Delaigle and Gijbels 
[[8, 9, 10]], Meister [[23, 24]] and Comte, Rozenholc and Taupin [[6, 7]]. 

However, kernel methods have disadvantages, not least being the fact that 
for effective implementation they require the characteristic function of the 
error distribution to have no zeros on the real line. In particular, the error 
distribution should not be compactly supported. 

Motivated partly by this difficulty, in the present paper we introduce a new 
estimation procedure based on ridging. Since this technique does not involve 
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a kernel, the optimal choice of which depends on the unknown smoothness 
of the target density, then our new method can have a relatively high degree 
of adaptivity. It does not require the regularity of the target function to 
be known in advance, and admits elementary cross-validation approaches to 
smoothing-parameter choice. See Fan and Koo [16] for discussion of adaptive 
methods in the setting of wavelet-based deconvolution. 

Importantly, ridging allows us to work with error distributions that have 
non-positive characteristic functions. In particular, using the new method 
we can readily treat problems where the error distribution is compactly 
supported. Ridging also eliminates discontinuities in integrals, which occur 
(for example) when using the sine kernel, and so avoids the need for tapers. 

The ridge-based method enjoys optimal convergence rates, both in stan- 
dard, or "nonoscillatory," cases where characteristic functions do not vanish, 
and in "oscillatory" cases where those functions have infinitely many zeros. 
In the latter setting, neither convergence rates nor optimality properties have 
been established before. The rates turn out to be particularly interesting. 

For example, although as a rule optimal rates depend on the smoothnesses 
of both the target and the error distributions, for all sufficiently smooth 
target distributions, they depend only on the smoothness of the error dis- 
tribution. This property is not observed in the nonoscillatory case. It makes 
choice of the ridge parameter remarkably straightforward; the parameter can 
be chosen quite easily, within a very wide range, without adversely affecting 
the rate of convergence. 

Ridging is also adaptable to errors- in- variables problems, where it enjoys 
similar advantages. 

2. Methodology. Suppose we observe data W\, . . . , W n generated by the 
model 



where Xj,6j, for 1 < j < oo, are mutually independent random variables, 
the sequences X\,X2,... and 61,62,- ■■ are identically distributed, and 6j 
has known density fs- We wish to estimate the density, fx, say, of X. 
Conventional estimators in this problem are given by 



K is a kernel function, X ft (i) = / e ltx K(x) dx is its Fourier transform, h > 
is a bandwidth and we have K {t (0) = 1. Usually, K is chosen so that if ft is 



(2.1) 



W j = X j + 6 j 





where 



(2.3) 
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compactly supported, but in order for good convergence rates to be achieved, 
it must also take account of the unknown smoothness of the distribution 
of W. In particular, if fw has d derivatives, then K should be chosen so 
that K ft (t) = 1 + 0(\t\ d ) ast^O. See, for example, Fan [12]. 

Thus, effective choice of K requires difficult, adaptive estimation of the 
smoothness of fx- This problem can be alleviated by employing the sine 
kernel, K(x) = (sinx) / {ttx)\ then, i^ ft (t) = 1 on the interval [—1, 1] and van- 
ishes elsewhere. However, in such cases the integral at (2.3) stops abruptly, 
before the integrand has a chance to descend to zero. That causes oscilla- 
tions of Gibbs phenomenon type in the final estimator; the oscillations can 
usually only be removed by fitting a taper to the integrand. An approach of 
this type has recently been discussed by Butucea and Tsybakov [3]. 

A particularly significant difficulty with kernel methods arises when ff 
vanishes at one or more points on the real line. Then there are poles in the 
integral at (2.3). Typically, the integrand behaves like a nonzero constant 
multiple of (t — p)~ l in the neighborhood of a pole at p, and so the integral 
does not exist. 

To avoid having to address this problem, it is customary in the literature 
to assume that /|* does not vanish anywhere. However, that constraint ex- 
cludes all the conventional compactly supported models for the distribution 
of 5, such as uniform and beta distributions, as well as some infinitely sup- 
ported models. Devroye [11] shows that consistency is achievable whenever 
the set {t:ff(t) =0} has Lebes gue measure zero. The underlying estima- 
tor requires selection of three parameter sequences, of which one excises a 
neighborhood of each pole from the integral in (2.3). This technique is ar- 
guably not attractive from a practical viewpoint. Another approach, where 
the condition /|*(t) ^ is relaxed, is suggested by Groeneboom and Jong- 
bloed [18]. However, their method is restricted to the case where 5 has a 
uniform distribution. 

The reason for introducing the factor K (t) to the integrand at (2.3) is 
to avoid difficulties when t is relatively large and the denominator, fg(t), 
is small. We suggest ridging the integrand instead. This is not completely 
straightforward, since the denominator in (2.3) might be negative. We pro- 
pose overcoming this problem via an indirect approach, which involves first 
making the denominator positive and then inserting the ridge, as follows. 
Note that, if we multiply both the numerator and denominator at (2.3) 
by fg(—t), we convert the denominator to \fg{t)\ 2 , a real-valued and non- 
negative function. This suggests taking the integral in (2.3) over the whole 
real line, and incorporating a positive ridge function, h(t), say, generally 
depending on n and sometimes also on t. 
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The last step may be implemented in a variety of ways, of which one is 
to use 

(2.4) f x { x )- — / n~?ft7IViT7T77vr7Z9 e dt > 



2nJ {\f?(t)\Vh(t)}r+z 

where fyfr(t) = n _1 J^j e ltW: > denotes the empirical characteristic function 
of W, r > describes the "shape" of the smoothing regime, and x V y rep- 
resents the maximum of x and y. In practice, we would confine attention to 
the real part of fx , which we write as fx = 3ft /x • The connection between 
fx, defined at (2.4), and a kernel-type estimator fx, at (2.2), can be seen 
by taking the kernel, L, in (2.2) to have Fourier transform L given by 

(2 5) L*(t) - I fc" (r+a) ^(-*)l^(*)l P . ^ !/?(*)! < h, 

( 5) l) lA(t) 4 , X\fHt)\>h. 

In the nonoscillatory case, where ff does not vanish on the real line, we 
may take h{t) equal to a constant depending on n. In particular, optimal 
convergence rates, for a wide range of different smoothnesses of fx, are 
obtained when h(t) does not depend on t. Therefore, in such settings our 
approach has removed the need to choose a kernel whose shape is adapted to 
the unknown smoothness of fx - Even when ff vanishes at points on the line, 
it is usually straightforward to determine h(t); in such cases that function 
is a constant multiplied by a power of \t\, and the power can generally be 
obtained from knowledge of fs- 

Note too that the approach at (2.4) removes the need for tapers, and in 
fact, the integrand in (2.4) is a uniformly continuous function for each x. 

In order for (2.4) to be well defined, the integrability of |/| t | r+1 needs to be 
assured. This is straightforward, however. Indeed, if fs is square-integrable, 
then r > 1 is sufficient. 

In "standard" errors-in-variables problems we combine the model (2.1) 
with a regression model, and observe data (Wi, Yi), . . . , (W n ,Y n ), generated 
as 

(2.6) Yj = g{X 3 ) + ej , W 3 = X 3 + b 3 , 

where Xj,5j,€j, for 1 < j < oo, are mutually independent random variables, 
the variables X 3 and 5 3 are as in the model (2.1) and ei, €2, ■ ■ ■ are identically 
distributed with E(e 3 ) = 0. We wish to estimate the smooth function g. 
There is a large literature on kernel methods in this problem; see Fan and 
Truong [17] for theory and Carroll, Ruppert and Stefanski [5] for a survey. 
A ridge-based estimator is given by g = $tg, where, for any r > 0, 



(2.7) g(x) 



ff{-t)\ffm r m/{\ffit)\yh{t)Y +2 e- itx dt 
f!\-t)\fl\t)\ r fMt)/{\f!\t)\Vh(t)Y +2 e- u *dt 



RIDGE DECONVOLUTION 



5 



again /#.(*) = n' 1 J2j e itWj , and v(t) = n~ l £.,■ Y j& im i . 

In Berkson's errors-in- variables problem the observed data are (X\,Yi), 
. . . , (X n , Y n ), generated as 

Y j =g(X j +8 j )+e j , 

where, as before, Xj,Sj,ej are mutually independent for 1 < j < oo, the Xj, 
5j and tj sequences are each identically distributed, E(ej) = and 5j has 
known density fs. 

There is apparently no published account of nonparametric methods in 
this problem, which dates from Berkson [1] and is generally addressed using 
parametric or semiparametric techniques (see, e.g., Reeves et al. [26] and 
Buonaccorsi and Lin [2]). Using our ridge-based approach, an estimator of 
g can be taken to be the real part of g, where 



2k J {|/J«(f)|Vf.(f)}'+ 2 ' 

w(t) = J2j DjYje ltx i and Dj > denotes the distance from Xj to the nearest 
other data value. The function w(t) estimates c ft (t)//| t (t), where c ft is the 
Fourier transform of the function c{x) = J fg(x — u)E(Y\X = u) du. 

3. Smoothing-parameter choice. When ff does not vanish on the real 
line and also in the case of supersmooth fs, where ff decreases exponentially 
fast to zero in the tails, it is usually adequate to take h = h{t) in (2.4) to be 
a constant depending on n. In this case, h is the single smoothing parameter 
on which the methodology depends. In contexts where fs is ordinary-smooth 
(i.e., fP decreases only polynomially fast) and is compactly supported, we 
usually need to take h to be a polynomial in t. Taking these cases together, 
we might consider 

(3.1) h(t)=h n (t)=n^\t\ p , 

where £ > and p > 0. 

Section 4 will discuss choices of £ and p that lead to optimal rates of 
convergence. The case where fs is compactly supported is particularly in- 
teresting; we treat it here through an example, as follows. If fs is the p-fold 
convolution of a symmetric uniform density, where p>2, and if fx has an 
integrable second derivative, then optimal convergence rates are achieved 
with p = 2 and ( = ^ in (3.1). 

In this setting we might take p = 2 and choose the constant £ in the 
ridge-parameter formula h(t) = £t 2 empirically; in the context where ff 
does not vanish, we can take p = and choose £ in h(t) = £ empirically. 
Therefore, we should address the case where h(t) = £\t\ p , with p > known, 
and discuss selection of £. Our approach to solving this problem will be via 
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cross-validation; see Hesse [20] for an account of this method in the case of 
kernel-based deconvolution. 

Let /x(£,x) denote the density estimator at (2.4) for this choice of the 
ridge. The aim is to minimize 

/ \fx(Z;x)-f x (x)\ 2 dx, 

or equivalently, using the Plancherel identity, to minimize the function J(£) — 
23M(f), where J(f) = / \f x (£;x)\ 2 dx and 

/(£) = /&(£; X )h (.) * - £ / ijjIljsMl-') * 

While </(£) is known, we have to produce an accessible version of /(£) to 
eliminate the unknown /|j>. That is given by 

1(0 = r 7 / fi fw 1 ^^', , ^ VVexp{rt(W 7 --W fc )}dt, 

w 27rn(n-l)i {|/ft(t)|VCm r+2Z ^ 

and so we use as our criterion 

(3.2) CV(e) = J(0-2K/(C)- 

(Note that r > has to be chosen sufficiently large to ensure the integrability 
of |/| t | r , and that, provided fg is square-integrable, this requires only r > 2.) 
Finally, we select the smoothing parameter 

(3.3) | = argminCV(0- 

Sections 4 and 5 will briefly discuss theoretical and numerical properties, 
respectively, of this technique. 

This approach has a straightforward analogue for determining smoothing 
parameters in the "standard" errors-in- variables problem (2.6). In that case 
we treat separately the numerator and denominator in (2.7), and so the 
estimator g is computed using two different smoothing parameters. The 
case of Berkson's errors-in-variables problem is more difficult to address, 
however. 

4. Theoretical properties. 

4.1. The nonoscillatory case. Here we consider the case where the char- 
acteristic function of 5 does not vanish on the real line. Given (3 > 1/2, C > 0, 
define T^ c to be the Sobolev class of all densities fx for which 

(4.1) J \f*(t)\ 2 (l + t 2 fdt<C. 
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Concerning the error density, f$, we consider ordinary-smooth densities 

(4.2) C 1 (l+t 2 )- i/ <|/| fc (t)| 2 <C 2 (l + t 2 )- I/ ; for -oo<t<oo, 
with v > and < C\ < C2 < 00, as well as supersmooth densities 

(4.3) exp(-ci|t| 7 ) < \f!\t)\ <exp(-c 2 |t| 7 ), for-oo<t<oo, 

with ci > C2 > and 7 > 0. 

We take the ridge function h(t) to be a scalar, that is, p = in the notation 
of (3.1). Let Exs denote expectation under the assumption that X±, . . . ,X n 
and Si,...,6 n have densities fx and f$, respectively. Write | • | for the L2 
norm on the space of square-integrable, real-valued functions. 

Theorem 4.1. In (3.1) take p = and £ as given below, (a) If fs satis- 
fies (4-2), and ifr > V (v" 1 — 1) and £ = v/{2[3 + 2v-\- 1), then the estimator 
fx = $lfx, with fx defined at (2.4), satisfies 

sup Exs\\x-fx\\ 2 = 0(n-WW^). 

(b) If fs satisfies (4-3), and if r = and < ( < \ , then 
sup E X s\\x ~ fx\? = 0{(lognr 2 ^}. 

Optimality of these convergence rates follows from results of Fan [[12, 15]] 
for Holder classes. Under additional regularity assumptions on f$, they can 
be extended to Sobolev classes; see Neumann [25] and Hesse and Meister 
[21]. A proof of Theorem 4.1 is included in a longer version of this paper 
(Hall and Meister [19]) 

4.2. The oscillatory case. Given f3 > 1/2, C > 0, define T^ c and T^ c to 
be, respectively, the classes of densities fx for which 

/ {\fm\ 2 + i/f (*)ii(/f mixi +t 2 fdt < c 

and \fx(t)\ < C|t|™ /3 ~t 1 / 2 > . [In the case of J| c we assume that is well 

defined.] These conditions amount to upper bounds on the smoothness of 
fx, alternative to that given by (4.1). To appreciate the connections among 
T"p C and T^ c , note that J^ c C T^ c , and that, provided 

(4-4) \(fk)'\l\1x\<Cx 

(which condition is typically true for Laplace-type distributions, for which 
(t)| decreases in a polynomial way as \t\ increases), fx £ J~\c en tails 
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fx S F$ C2 , where C 2 = C(l + C x ). Also, if fx £ ^ and e e (0,0 - ±), 
then /x G ^- e ,c 3 > where 

C 3 = jT 1 (1 + t 2 )/ 3 rft + C 2 Itl" 2 "" 1 (1 + t 2 )" dt; 

and if /x E and (4.4) holds, then fx € ^g_ e c 4 , where 

/•I /'OO 

c 4 = (Ci + i)J (l + tydt + cid + c) j ^-^{i+efdt. 

To give a more intuitive description of the densities in J~p C , we mention 
that, for integer /3 + 1, the relation /x £ follows if the derivatives f x \ x ) 
tend to zero as \t\ — » oo, for all Z < + |, and / |/( /3+1 / 2 )(x)| fix < C. Hence, 
Tp C might be interpreted as an Li(R)-analogue of the Sobolev class 

Given p, > 1, ^ > 0, < Ci < C2 < 00, A > and T > 0, denote by C/^a 
the class of probability densities fg for which 

(4.5) Ci|sin(At)| /i |tr iy <|/ (5 ft (t)|<C2|sin(At)|^|t|"^ for all \t\ > T 

and ff{t) does not vanish for \t\ <T. 

The parameter p describes the "order" of the isolated zeros of ff. Note 
that all self-convolved uniform densities are in Q Ufl \ for appropriate parame- 
ter choices, as too are their convolutions with any ordinary-smooth density. 
Accordingly, we introduce the class G'^^x of all densities satisfying (4.5) 
when \t\~ u is replaced by exp(— d|t| 7 ), with d, 7 > 0. For instance, convolu- 
tions of uniform densities with normal densities are included in G'^uX f° r 
suitably chosen parameters. 

Preparing for Theorem 4.2(a), put 

2p(3-u], if 2/5 + 2v + 1< 4pf3, 



(4.6) pi 



2p-l 
p + v 



(4.7) 



2p-Y 

£ \2p(3 — v, -j^— — - J , otherwise 



if 2/3 + 2v + 1 = 4/x/3, 



2//-1. 

-, if 2/3 + 2zy + 1 < Ap/3, 

— — , otherwise. 

2/3 + 2i/ + l' 



Theorem 4.2. (a) Define h(t) as at (3.1), with p and ( as in (^.tf) and 
(4-T), respectively. If f$ £ (y^A where v > and // > 1, and i/r>0V (z^ _1 — 
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1), then, for fx = 5£/x and j = 2,3, we have 
sup Exs\\fx ~ fxf 

(4.8) 

' Oin-Wti), if2(3 + 2v + K4:(3n, 

0(n-V(2M)iogn), if 2/3 + 2v + 1 = 4/3/x, 
^-2/3/(2/3+2^+1)^ otherwise. 

(b) Define h(t) as at (3.1), with p = and £ S (0, j). // /j € G'^aX' an d 
if r = 0, i/ien /or j = 1, 2, 3, 

sup ^||/x-/x|| 2 = 0{(log ? i)- 2 ^}. 

To interpret part (a) of the theorem, note that for ordinary-smooth f$, 
the non-classical rate n~ 1 ^ 2 ^ arises if fi is large enough in relation to v 
and (5. Then the difficulty created by the isolated zeros of ff dominates the 
difficulty caused by the decay of ff in the tails. Part (b) implies that, unlike 
the case of ordinary-smooth densities, the existence of isolated zeros of ff 
does not cause any deterioration of convergence rates for supersmooth error 
densities, even for the comprehensive smoothness class T\q. 

Next we show that the convergence rates in Theorem 4.2(a) are optimal, 
or nearly optimal in the case 2/3 + 2u + 1 = 4/3 fx. 

Theorem 4.3. Let f be an arbitrary estimator of f based on the sample 
Wi, . . . , W n . Assume the existence of at least one T > such that 

limsup |/ 5 ft (£)|/|t -Tf < oo and 

(4.9) 

limsup |(/£)'(t)|/|t- IT" 1 <oo. 

t-*T 

Then for j = 2, 3, 

(4.10) limsup sup n 1/{2 ^E X s\\f ~ fxf > 0. 

Tl— >00 p c t?3 

If, in addition, fs £ Q V[t x o,nd \(ff)'{t)\ < c|t| _!y for some c > and all t, 
and C is sufficiently large, then 

(4.11) limsup sup n^ l{2p+2u+1) E X5 \\f - f x f > 0. 

Note that, for integer /x, (4.9) is satisfied if ff is /i-times continuously 
differentiable in a neighborhood of T, and (ff) (0) (T) = • • • = (ff)^' 1 ^ (T) = 
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0. Due to the greater stringency of the smoothness classes when j = 2,3, the 
lower bound (4.11) does not follow from earlier results for j = 1. 

Note that (4.9), in contradistinction to (4.5), requires ff to decrease 
to zero at only a single point. Hence, if ff decreases to zero at different 
polynomial rates at different points, and if the fastest of these rates is k, say, 
then the convergence rate of an arbitrary estimator / to fx can be no faster 
than n~ l ^ 2K \ Also note that (4.10) remains valid if we replace fx £ Fqc 
by stronger smoothness assumptions, for instance, classes of supersmooth 
fx where /j£ shows exponential decay or densities whose Fourier transforms 
are compactly supported as long as the endpoint of their support is larger 
than T. 

Assumption (4.9) is weaker in other respects than (4.5), in particular, with 
regard to tail behavior of ff. Although the second part of (4.9) involves an 
assumption on the derivative of ff, that condition is a natural reflection of 
the first part of (4.9). 

We mention that slower minimax results are derived for the estimator 
(2.4) under classes of ordinary smooth densities which are weaker than J^q 
or J-'qq ; see the long version of this paper for details (Hall and Meister [19]). 



4.3. Adaptivity. Here we state an optimality result for data-driven se- 
lection of the smoothing parameter, £, defined at (3.3). For simplicity, we 
confine attention to the nonoscillatory case and a special oscillatory case. 

Theorem 4.4. Assume either the conditions of Theorem 4.1(a) with 
fx 6 J~pc> or conditions of Theorem 4.2(a) with fx £ Fpc UJ-pc> an ^ 
r>2V(2/z/), 4/x/3 < /3 + 2v + 1 and p < (2/i + z/)/(4/i-l). Then, with prob- 
ability 1, 

, x IliWxIl 2 

(4.12) s — s >l. 

inf c>0 £||/£-/x|| 2 

Note that the parameters r and p do not depend on (3, and give the "scale" 
of smoothness classes to which the choice of £ adapts. 

A proof of Theorem 4.4 is included in the long version of this paper (Hall 
and Meister [19]). 

In related work, although in the setting of kernel methods, Delaigle and 
Gijbels [9] compare plug-in and bootstrap methods for choosing the band- 
width, and Delaigle and Gijbels [10] suggest a bootstrap technique. The 
approaches discussed in both articles produce, like our cross-validation al- 
gorithm, asymptotic optimality. A major difference, however, is that in our 
setting, for our nonkernel method, the level of smoothness is not supposed 
known in advance. In this context, Theorem 4.4 shows that cross-validation 
can choose the degree of smoothness adaptively. By way of contrast, the 
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selection of the kernel, in the case of a kernel method, is made in the light 
of the assumed level of smoothness of the target density, and theoretical 
arguments are predicated on that level being correct. 

5. Numerical properties. Here we present simulation results addressing 
the performance of our estimators. We give graphs in two cases, (a) the 
regular-smooth, nonoscillatory case, and (b) the regular-smooth, oscillatory 
case. 

In each setting, and for each parameter setting, we drew 100 random sam- 
ples and ranked them in order of the size of integrated squared error (ISE) . 
In each figure the five unbroken curves are density estimates correspond- 
ing to the largest, 25th largest, 50th largest, 75th largest and smallest value 
of ISE; the dashed curve depicts the true density, fx - Sample size, n, is given 
below each figure. As discussed in Sections 3 and 4, in cases (a) and (b) we 
fixed p at and 2, respectively, and chose the only remaining smoothing 
parameter, £ in the formula h(t) =£|t| p , °y cross-validation. 

Numerical results in case (a), but using methods quite different from our 
own, are widely available in the literature. For recent examples, see Delaigle 
and Gijbels [[9, 10]] in the setting of kernel techniques, and Comte, Rozenholc 
and Taupin [[6, 7]] for penalization methods. Case (a) is also addressed in 
Figures 1-6, where we take fg to be the Laplace density. As the target density 
fx we used the bimodal ^{N(2, 1) + N(-2, 1)} density for Fi gures 1 and 2, 
the two-fold convolution of the Laplace density for Figures 3 and 4, and the 
shifted x 2 -density fx(x) = (1/16) (x + 4) 2 exp{-±(x + 4)}, for x > -4, for 
Figures 5 and 6. 

In Figures 7-12 we illustrate case (b), for the same respective densities 
fx as in case (a). In case (b), fg was the uniform density on [—1,1]. 

It can be deduced from the figures for cases (a) and (b) that: (i) our esti- 
mator has a little more difficulty with the bimodal density, compared with 
the unimodal ones; (ii) the estimator finds the nonoscillatory case (a) more 
challenging than the oscillatory case (b); and (iii) performance gradually 
improves with increasing sample size. Property (i) is to be expected, since 
case (a) is characterized by greater structure, which the estimator is pressed 
to discover; property (ii) is the result of fg decreasing more rapidly in the 
tails in case (a) than in case (b); and property (iii) reflects a steady decline 
in values of averaged integrated squared error as n grows. Simulations in 
the case of supersmooth error, more precisely, fg = N(0, 1) and fg equal to 
the convolution of a normal and a uniform density, show that n = 1000 gives 
results broadly similar to those for n = 400 in case (b) . 

Figures 13 and 14 show results for the kernel estimator in the setting of 
Figures 1 and 2. Sample sizes are n = 400 and 700 for the respective figures, 
the density in each case is the normal mixture defined three paragraphs 
above, and the smoothing parameter (this time, the bandwidth) was chosen 
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Fig. 2. n = 700, r — 2, p = 0, £ 6y average integrated squared error = 0.0009 




Fig. 4. n = 700, r = 2, p = 0, £ fej/ CV average integrated squared error = 0.004. 
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Fig. 5. n = 400, r = 2, p = 0, £ by CV average integrated squared error = 0.005. 

by cross-validation. The kernel for these results has Fourier transform (1 — 
i 2 ) 3 for \t\ < 1; this choice is popular in kernel deconvolution. For both 
sample sizes the kernel estimator is more erratic than its ridge competitor, 
reflecting the fact that it has consistently higher values of average mean 
integrated squared error. This is observed in the case of the unimodal density 
too. 

6. Outline proofs. In this section we write const, for a generic positive 
constant. 

6.1. Preparatory lemma. The following result can be proved by Parse- 
val's identity and Fubini's theorem. Define G n = {t eR: \jf(t)\ 2 < h(t) 2 }, 
and write for the complement of G n . 

Lemma 6.1. The mean integrated squared error of our estimator is 
bounded above by V n + B n and by V\^ n + Vi,n + B n , where 

V n = ^J\fl\t)\^h(t)-^dt, 
^,„ = fn- 1 / \f!\t)\^h(t)-^dt, 
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Fig. 6. n = 700, r = 2, p = 0, £ 6t/ CV average integrated squared error = 0.004. 



2,n 



1 

27T' 



n- 1 / \f!\t)\- 2 dt, 



G< 



B n = —sup \f%(t)\ 2 dt. 
2vr fx J Gn 



The bounds F n + B n and Vi n + V^n + -B n will be used to derive rates for 
supersmooth and ordinary-smooth error densities, respectively. In particu- 
lar, Lemma 6.1 leads directly to Theorem 4.1. 

6.2. Proof of Theorem 4.2(a). Define G n j = G n Dlj and Ij = [tj-,tj+], 
where tj : ± = (j ± ^)ir/\. Since G n is symmetric about zero, and G n ,o is 
empty for n sufficiently large, we may restrict attention to j > 1. Now, for 
t € G n j, 



sm 



(At) | < C^n-^lt^Vv = <p ltn (t), 



say. It may be shown using a geometric argument that G n> j C [t' n j_,t' n j + ], 
where, taking the plus and minus signs, respectively, t' n j ± denotes the in- 
tersection of the horizontal line y = c/?i jn (ij i +) and the line connecting the 
points (j7r/A,0) and (i,- j+ ,l) [(j7r/A,0) and (t^-,1)]. Therefore, 

7T ,, / 1\(P+ i 0/M 



3 + 



where C* > 0. 
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Fig. 8. n = 700, r = 2, p = 2, £ 6t/ CV average integrated squared error = 0.001. 
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Fig. 10. n — 700, r = 2, p = 2, ^ by CV average integrated squared error = 0.001 
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Fig. 11. n — 400, r = 2, p = 2, £ by CV average integrated squared error = 0.001. 



In the arguments immediately below, leading to a bound on Vx >n , we use 
the property G n j C [t' n ^_,t' n ^^ + ] for j < j' n ~ n^/( p+l/ ), where denotes the 
largest j for which the inclusion relation holds. For larger values of j, we 
use the property G n j C /j , thus obtaining 



2C(r+2)-lN 



£ / n ^ + { sin (At)} 2 ^+l)| f |-2(^2p)-2r(, +P ) dt 

3=1 n,3 



+ / {sinCAt)} 2 "^ 1 ' |t|-2(»H-2p)-2r(i/+p) di 

■?« rf' . , —jn/X 

£"f it' . -jV/A 

+ 0(n 2C(r+2)_1 j^ 1 - 2 ( 1/ + 2 P)-2r(^+p) ) 



Jn 

E 



< 0(n~ 1+c(2p ~ 1)/At ) ^ j-2p+(f+ v )/n + o( n -i+C(2^+i)/(^+p)^ 
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Fig. 12. 7i = 700, r — 2,p = 2,^by CV average integrated squared error = 0.0007 . 



From this bound it may be proved that 

( n -l+C(2M-l)/M) } if p > + i/)/(2/i - 1), 

Vi,„=< CKn^+^-^logn), if p= ( M + zy)/(2/x - 1), 

j n -i+C(2H-i)/(i'+p)) j otherwise. 



A similar argument can be used to bound the bias term B n : 

in r t' 



B n <- S u P j2 / l/$(*)l 2 *+o(& 



2/3 x 



At this point it is necessary to treat separately the cases fx 6 
j = 2,3. When j = 3, 



fx j=l 



0(n"^), ifp + zy<2/i/3, 

0(n - ^ log n), if p + v = 2(j,/3, 
^ n -2pc/(p+v)^ otherwise. 
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dt. 
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When j = 2, 

max |/|(t)| 2 ={ max |/|(i)| 2 - mm 
+ . min |/|(t)| 2 

< o(r 2/3 ) ^(i+miilWI 2 + '(*)!> 

n>3.— 

This leads to the same upper bound on B n as in the case j = 3. 

Finally we bound Vjjn- There, we need a lower bound for G n , obtained as 
follows. Then t 6 G n j is implied by 

|sin(At)| < C£ 1/ ' l n-Ui t \t\Wi i = <p2, n (t), 

say. Let t" n j _ < t" n ■ + denote the intersections of the horizontal line y = 
<P2,n(tj,-) and both tangent lines at t=jit/\ of the curve with equation 
y = | sin(Ai)|. Then, for sufficiently small j, [t'^j_, t'n,j,+] ^ Gn,ji an d so 

where C| > 0. 

Let i£ ~ n^/O^) denote the largest j for which [t^ d _,t^ + } C G nj -. We 
use this inclusion relation if j < j", and the relation /j C G n j otherwise. 
This leads to the property 

i" 

Jn 

[0,oo)nG^c|J(/ J \K J ._,^. + ])U/o. 

To bound V2, n we apply this formula to the integral over t € with i > 0. 
The contribution from Iq can be shown to equal 0(n _1 ), and so is negligible. 
Defining 1^ = Ij\K,j,-,€,j,+\> and the shifted set 7^- = [-vr/(2A), t'^_ - 
jir/X] U [t^j4- — J7r/A, 7r/(2A)], it may be proved that 



7 

^2,»<0(n -1 )$2 /„ |sin(Ai)r 2At N 2l/ di 

7'" 

^Oin-^Yj 2 " \sm(\t)\- 2 » dt 

A1 Jl ah 



= o( ?l -i+c(2/*-i)/>i)y J *+(p+ I /)(i-2) J )/ ( ( i 

This leads to the same upper bound for Vi n that we derived earlier for V\ n . 
Substituting for p and £ from (4.6) and (4.7), we obtain Theorem 4.2(a). 
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6.3. Proof of Theorem 4.2(b). Using Lemma 6.1, it may be shown that 
V n = 0(n- 1+ ^). Note too that, for t&G n , 

(6.1) |sin(At)| < C 1 ~ 1/M n- c /' x exp{(d//i)|t| 7 }. 

Define t n = {C/(2d)} 1 /7(i ogn )i/7. if | t | <t n ,then |sin(At)| < C^^n'Wri, 
and hence, using (6.1), \t - kir/\\ < {TrC' l ^/{2\)}n^/^ for inte gers k. 
From these properties it may be proved that, for fx € J~pc ^ or 7 = 1, 2 or 3, 

B n = 0(tnn-V<M + 1- 2 ?) = 0{(logn)-W7}. 

6.4. Proof of Theorem 4.3. First we derive (4.10). We introduce the den- 
sity fo(x) = (1 — cosx)/(7ra; 2 ), having the "tent"-shaped Fourier transform, 
fo(t) = 1 — \t\ for \t\ < 1, and the supersmooth Cauchy density f\(x) = 
(l/7r)(l +X 2 )" 1 . Define too the densities 

fn,e{x) = \e n fi{e n x) + \e n fo(e n x){l + rj6cos(Tx)}, 

with rj 6 (0, j], G {0, 1} and a positive- valued sequence e n J, 0, to be defined 
later. The characteristic function corresponding to f n $ is 

fnA x ) = ^lit/en) + \ft(t/e n ) + ^efiKt + T)/e n } + \ V ef^{{t - T)/e n }. 
Using the fact that e n [ 0, it can be shown that if r\ is sufficiently small, then 

Write fi * f2 for the convolution of functions f\ and fi . It was proved by 
Fan [12] that if the x 2 -distance between the densities f n> o * fs and f n% \ * fs 
satisfies 

X 2 (fn,l * fs,fn,0 * fs) 

{6 ' 2> l(/^./«)W-(A,.«A)MI' d , = 0(B -, ) 

(/n,0 */*)(») 



then, for a constant c> 0, for any estimator / and for all sufficiently large n, 

.0 



(6-3) sup ^HZ-Zll^cll/n,!-/^- 11 



It may be shown from the definition of f n ,o(x) that 



2/, . - , . f W 9g -l f {(/n,0 ~ Jn.l) * Ja}(g)f ^ 
X (7n,l * 7(5, 7n,0 * 7(5) < 2£„ / r— ; r fite. 

Also, if q is so large so that f\ y \ <q fs(y) dy > 0, 

{/l(e»0 */«}(*) >t _1 / / 5 (y){l + 24(x 2 + y 2 )}- 1 d 2/ 

%l<8 



> const. (1 + e 2 x 2 ) . 
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Therefore, 

X 2 (/n,i * fs, fn,o * fs) < const. / {(f nfi - f n ,i) * fs} 2 {x){e~ l + e n x 2 ) dx. 



To bound the right-hand side, use the identity (/ ft )' = *{'/(')} ft and the 
Parseval identity to obtain 

X 2 (/n,l * fs, fnfi * fs) < Const. L 1 J " fn,lf(t)\ 2 \fl\t)\ 2 dt 

+Snj \{{fn,0-fn,l) lt }\t)\ 2 \ff{t)\ 2 dt 
+S n j \{fnfl-fn,lf{t)\ 2 \{ff)'{t)\ 2 dt . 

Therefore, using the fact that (/ n ,o — /n,i) ft and {(f n ,o — /n,i) ft }' are both 
supported on [— T — e n , — T+e n ] U [T— e n ,T+e n ], we may show that x 2 (/ n ,i * 
fs,fn,o * fs) equals 

ole- 1 f T+£n \fl\ t )\ 2 dt + e n f T+£ " \(ff)'(t)\ 2 dt + e- 1 \ff(t)\ 2 dt). 

I JT-e n JT-e n JT-e„ J 

Using (4.9) to bound the right-hand side, we may thus show that x 2 {fn,i * 
fs, fnfi * fs) = 0(^n M )- Hence, choosing e n = n, - VCw guarantees the validity 
of (6.3). Again, Parseval's identity may be used to prove that 

\\fn,l ~ fnflW 2 > Const. £ n , 

which implies (4.10). 

Finally we turn to (4.11). With the densities /o, /i as above, we construct 
the subclass of densities 

f n , e (x) = l{fi(x) + fo(x)} + const. ^■j" /3 " (1/2) cos(2ix)/ (x), 

with k n an integer satisfying k n | oo, and 6*j 6 {0, 1}. The class contains 

all densities / n ,e- By modeling the 0j's as independent random variables with 
P{9j = 0) = i, the lower bound (4.11) may be established using arguments 
similar to those in the proof of Fan [15]. 
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